Introduction

The National Basketball Association (NBA) is the premier basketball league in the world, consisting of 30 teams (29 based in the USA and one in Canada) divided into two conferences of 15 teams each. Every year (season) is split into two phases:

  • the regular season, when each team plays 82 games;
  • the playoffs, when only the top 8 teams for each conference compete for the NBA championship title.

At the end of each season, following a vote conducted among more than 100 media members, various awards are conferred. Among these, the most coveted is undoubtedly the MVP (Most Valuable Player), which is awarded to the best player of the regular season.

The goal of this project is to create a model capable of predicting, by analyzing player stats, the MVP of the 2018-19 NBA season. Since the season is still in progress, the results will necessarily be partial. This project can therefore be interpreted as a prediction of who would win the MVP award if the season had ended on 1/12/19.

Over the years, much has been said about the criteria that should be used in determining the MVP (for example, how much team results or number of games played should count). Therefore, another reason that prompted me to carry out this analysis was to analyze past ballots to determine which variables were the most considered by voters before casting their preferences.

Specifically, voting takes place as follows: each voter casts a vote for 1st to 5th place selections. First-place votes are worth 10 points, second-place votes 7, third-place votes 5, fourth-place votes 3 and fifth-place votes are worth 1 point. The player who gets the highest score is elected MVP of the regular season.

The data were downloaded from Basketball Reference, with the aim of obtaining a dataset composed of the stats referring to the best individual seasons by NBA players during the years. The initial idea was to consider the 1980-81 season as the first one in the dataset, because it is from that year that the MVP has been awarded by media members (previously the winner was selected by the players); but later it was decided to carry out the analysis starting from the following season (1981-82), since not all the stats were available for the previous year.

Exploratory Analysis

A regression model was used to predict the total score each player will get in the vote. But before describing the results obtained, the most interesting plots are shown.

The first one highlights something that is probably obvious to anyone who follows the NBA, but might be surprising to everyone else. It is clear from the following boxplots that those who have been awarded the MVP turn out to be above average in two stats that are certainly not positive: missed shots per game and turnovers per game.

The explanation is actually very simple: being the main players of their team, the MVPs are usually the players who attempt the most field goals during the game; it is therefore normal that they are also the ones who miss the most shots (plot on the left).

As for turnovers, the reason lies in the fact that MVPs handle the ball much more than their teammates and it is not surprising that they are the team leaders in this stat as well.

It is thus clear that even the “negative” variables can be useful in predicting whether a player will be elected MVP at the end of the season.


The following boxplot is animated and allows to understand that advanced stats are increasingly considered in the evaluation of a player (the term “advanced” refers to those statistics created ad hoc to evaluate the efficiency of a single player or his contribution to the success of the team). This plot refers to BPM (Box Plus/Minus), but also for WS/48 (Win shares for 48 minutes) and PER (Player Efficiency Rating) the results are similar.

What stands out is the extreme relevance that this stat seems to have in the classification of MVPs, especially in recent years; the reason probably lies in the fact that these stats are quite new (the values for the previous years were calculated later) and therefore, even if they had wanted to, the voters could not have consulted them before casting their ballot.

Looking at the various plots also allows to get an idea about the explanatory variables that might be the most important in predicting the target variable, which is called MVP Share and is the ratio between the total points obtained by a player in the voting and the highest achievable score.

The first thing that stands out here is that there seems to be a correlation between the points scored and the target; it can be seen from the plot that the only player to have won the MVP having scored less than 20 points per game was Steve Nash, twice (however, his number of assists was very high).

As for the number of games played, it is clear that players who do not play many games are penalized: in fact, no one has ever been elected MVP having played less than 85% of the games (it is worth remembering that we did not consider in the analysis the years prior to the 1981-82 season).

There also seems to be a strong correlation between the team winning percentage and the number of votes received: in the years considered, only Russell Westbrook in 2017 and Moses Malone in 1982 have won the MVP award having played for a team which won less than 60% of the games played.


The following animated scatterplot shows that in recent years superstars have been playing fewer minutes than in the past, due to an increasing attention paid to the physical well-being of players. This may also be the reason why the number of minutes played per game are not particularly correlated with the target variable.


Results

The variable importance estimation for the final model produced very reasonable results: respectively, PER, team win percentage, BPM, WS/48, percentage of games played and points per game were the most useful variables in the prediction. In this case, the main thing to point out is how much two variables that do not concern the individual performance of players are considered by voters: team results and number of games played. Furthermore, once again the advanced stats (PER, WS/48, BPM) prove to be more useful than the “traditional” ones (points, rebounds, assists).


The following plot provides an initial idea of the results obtained, with the purpose of showing, at the time of the analysis, the gaps between the predicted first five voting leaders.

N.B. Due to space problems, Giannis Antetokounmpo is indicated, unlike the other players, with the first name.

Images taken from www.tsn.ca


The following bar chart, on the other hand, provides a more general idea of the results: it shows the top 10 players ordered by the predicted points they will get in the voting.

At the moment it seems to be a two-way race between Giannis Antetokounmpo and James Harden. Anthony Davis, who is probably having the best season from a statistical point of view, is penalized by the results of his team.

For those interested in the full results, the following table indicates, together with the main stats, the predicted scores for each of those players who, according to the prediction, will receive at least one vote at the end of the season.

Conclusion

The model predicts that Giannis Antetokounmpo will be the NBA MVP for the 2018-19 season; moreover, the analysis showed that the factors most considered by the voters are:

  • advanced stats (especially recently);
  • team performance;
  • number of games played.

Finally, an observation: those who follow the NBA may have been surprised by some results, such as the 10th position predicted for Stephen Curry, who is however penalized by having played fewer games than the other players considered. So, if he doesn’t get injured again this season, it is reasonable to think that he will climb various positions. Similarly, if a player (whoever he is) doesn’t play the last 30 games, he probably won’t even get a vote. It is worth remembering that, having considered mid-season stats, the results of the analysis are inevitably partial.